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Abstract 

We develop correlated random measures, random measures where the atom weights can ex¬ 
hibit a flexible pattern of dependence, and use them to develop powerful hierarchical Bayesian 
nonparametric models. Hierarchical Bayesian nonparametric models are usually built from 
completely random measures, a Poisson-process based construction in which the atom weights 
are independent. Completely random measures imply strong independence assumptions in the 
corresponding hierarchical model, and these assumptions are often misplaced in real-world set¬ 
tings. Correlated random measures address this limitation. They model correlation within the 
measure by using a Gaussian process in concert with the Poisson process. With correlated ran¬ 
dom measures, for example, we can develop a latent feature model for which we can infer both 
the properties of the latent features and their dependency pattern. We develop several other 
examples as well. We study a correlated random measure model of pairwise count data. We 
derive an efficient variational inference algorithm and show improved predictive performance 
on large data sets of documents, web clicks, and electronic health records. 

1. INTRODUCTION 

Hierarchical Bayesian nonparametric models (Teh and Jordan, 2010) have emerged as a power¬ 
ful approach to analyzing complex data (Williamson et al., 2010; Fox et al., 2011; Zhou et al., 
2012). These models assume there are a set of patterns, or components, that underlie the observed 
data; each data point exhibits each component with different non-negative weight; the number of 
components is unknown and new data can exhibit still unseen components. Given observed data, 
the posterior distribution reveals the components (including how many there are), reveals how 
each data point exhibits them, and allows for this representation to grow as more data are seen. 
These kinds of assumptions describe many of the most common hierarchical Bayesian nonpara¬ 
metric models, such as the hierarchical Dirichlet process (Teh et al., 2006), the Gamma-poisson 
process (Titsias, 2008), the beta-Bernoulli process (Thibaux and Jordan, 2007), and others. 
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For example, in Seetion 6 we analyze patient data from a large hospital; eaeh patient is de- 
seribed by the set of diagnostie eodes on her ehart. Potentially, the full data set refleets patterns in 
diagnostie eodes, eaeh pattern a set of diagnoses that often oeeurs together. Further, some patients 
will exhibit multiple patterns—they simultaneously suffer from different elusters of symptoms. 
With these data, a Bayesian nonparametrie model ean uneover and eharaeterize the underlying 
pattern and deseribe eaeh patient in terms of whieh patterns she exhibits. Reeent innovations in 
approximate posterior inferenee let us analyze sueh data at large seale, uneovering useful eharae- 
terizations of disease and injury for both exploration and predietion. In our study on medieal data, 
we diseover eomponents that summarize eonditions sueh as eongestive heart failure, diabetes, and 
depression (Table 1). 

But there is a limitation to the eurrent state of the art in Bayesian nonparametrie models. To 
eontinue with the example, eaeh patient is represented as an infinite veetor of non-negative weights, 
one per eomponent. (There is a eountably infinite number of eomponents.) Most hierarehieal 
Bayesian nonparametrie models assume that these weights are uneorrelated—that is, the presenee 
of one eomponent is unrelated to the presenee (or absenee) of the others. But this assumption is 
usually unfounded. For example, in the medieal data we find that type 2 diabetes is related to 
eongestive heart failure (Table 4). 

In this paper we solve this problem. We develop eorrelated random measures, a general- 
purpose eonstruetion for infusing eovarianee into the distribution of weights of both random mea¬ 
sures and hierarehieal Bayesian nonparametrie models. Our approaeh ean eapture that a large 
positive weight for one eomponent might eovary with a large positive weight in another, a type 
of pattern that is out of reaeh for most hierarehieal Bayesian nonparametrie models. We demon¬ 
strate that bringing sueh eorrelations into the model both improves predietion and reveals rieher 
exploratory struetures. Correlated random measures ean be used as a model for a eolleetion of 
observed weighted point proeesses and ean be adapted to a wide variety of proven Bayesian non¬ 
parametrie settings, sueh as language modeling (Teh, 2006), time series analysis (Fox et ah, 201 1), 
dietionary learning (Zhou et ah, 2009), and nested models (Paisley et ah, 2015). 

How do we aehieve this? Most Bayesian nonparametrie models are built on eompletely random 
measures (Kingman, 1967) and the independenee of the weights is an artifaet of this eonstruetion. 
To ereate eorrelated random measures, we infuse a Gaussian proeess (Rasmussen and Williams, 
2005) into the eonstruetion with a latent kernel between eomponents. This lets us relax the striet 
independenee assumptions. The details involve showing how to use the Gaussian proeess in eon- 
eert with the Poisson proeess, and without saerifieing the teehniealities needed to define a proper 
random measure. As a result of the general eonstruetion, we ean build eorrelated variants of many 
hierarehieal Bayesian nonparametrie models. 

We will deseribe four eorrelated random measures. The first is a eorrelated nonparametrie 
version of Poisson faetorization (Canny, 2004; Gopalan et ah, 2014). This is a model of eount 
data, organized in a matrix, and it will be the model on whieh we foeus our study. We show how to 
derive an effieient variational inferenee algorithm to approximate the posterior and use it to analyze 
both medieal data and text data. We also deseribe a eorrelated analog of the beta proeess (Hjort, 
1990) and two eorrelated binary latent feature models, eaeh expanding on the hierarehieal beta- 
Bemoulli proeess (Griffiths and Ghahramani, 2006; Thibaux and Jordan, 2007). We note that the 
diserete infinite logistie normal model in Paisley et al. (2012) is a normalized eorrelated random 
measure, a eorrelated adaptation of the hierarehieal Diriehlet proeess (Teh et al., 2006). 
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Related work. Correlated random measures (CorrRMs) ean capture general covariance between 
the measure of two sets, while also being atomic and extendible to hierarchical models that share 
atoms. In the Bayesian nonparametric literature, researchers have proposed several other random 
measures with covariance. We survey this work. 

Cox and Isham (1980) introduced the Cox process, a Poisson random measure whose mean 
measure is also stochastic. Cox processes can capture covariance if the stochastic mean measure 
exhibits covariance. Unlike CorrRMs, however, Cox processes do not allow for noninteger atom 
weights. Furthermore, most common examples of Cox processes, such as the log-Gaussian Cox 
process (Mpller et ah, 1998), do not allow for atom sharing. We note that CorrRMs share the 
doubly stochastic construction of the Cox process; in Appendix A.l we show that CorrRMs can be 
alternatively viewed as a stochastic transformation of a Poisson process. 

Determinantal point processes (Borodin, 2009) are point processes where the number of points 
is related to the determinant of a kernel function. Determinantal point processes exhibit “anti¬ 
dumping” (i.e., negative correlation) because atoms that are close together will not appear together. 
In contrast, hierarchical correlated random measures do not rely on the atom values to determine 
their correlation, and can capture both negative and positive correlation among their weights. 

Doshi-Velez and Ghahramani (2009) present a specific correlated feature model by positing a 
higher level grouping of features. In their model, observations exhibit correlation through these 
groups. However, groups only contain features and thus these feature can only express positive 
correlation. In the latent feature models based on CorrRMs, the latent locations of two features can 
induce a negative correlation between their co-occurrence. 

Ammann et al. (1978) study infinitely divisible random measures that are not completely ran¬ 
dom. These measures are referred to as having “aftereffects" in that every atom in the random 
measure has more effect on the measure than just its size. Correlated random measures are random 
measures with aftereffects, but they are not necessarily infinitely divisible. 

As we have mentioned, the discrete infinite logistic normal (DILN) (Paisley et ah, 2012), a 
Bayesian nonparametric topic model, is a normalized instance of a correlated random measure. 
DILN first generates top level shared atoms from a Dirichlet process, along with latent locations 
for each. It then draws each document with a gamma process from those atoms and a Gaussian 
process evaluated at their locations. Finally, it convolves these processes and normalizes to form a 
probability measure. We discuss DILN in detail in Section 4.3. 

Finally, there has been a lot of research in Bayesian nonparametrics about dependent random 
measures, originating from the work of MacEachem (1999), broadly surveyed in Foti et al. (2015), 
and used in applications such as for dynamic ordinal data (DeYoreo and Kottas, 2015), neuron 
spikes (Gasthaus et al., 2009), and images (Sudderth and Jordan, 2009). Dependent random mea¬ 
sures select atoms for each observation through a priori covariates, such as a timestamp associated 
with the observation. Atoms are correlated, but only though these observed covariates. The main 
ideas behind correlated random measures and dependent random measures are different. Corre¬ 
lations in CorrRMs are not based on side information, but rather are recovered through a random 
function associated with each observation. One dependent random measure that is close in con¬ 
struction to correlated random measures is the dependent Poisson process thinning measure of Foti 
et al. (2013). This measure can be reinterpreted as a type of correlated random measure; we dis¬ 
cuss this connection with technical details in Section 3. Another construction, compound random 
measures (Griffin and Leisen, 2014), builds dependent random measures by using a score func¬ 
tions to generate a set of measures conditional on a shared Poisson process. Compound random 
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measures and the dependent Poisson proeess thinning measure share with our approaeh the idea of 
separating out the atom generation from the independenee breaking portion. 

2. BACKGROUND: COMPLETELY RANDOM MEASURES 

In this seetion we review random measures (Kingman, 1967; Cinlar, 2011). We deseribe the Pois¬ 
son random measure, eompletely random measures, and normalized random measures. This sets 
the stage for our eonstruetion of the eorrelated random measure in Seetion 3. 

A random measure M is a stoehastie proeess that is indexed by a sigma algebra. Let (i7, S) be 
a measurable spaee, for example E is the real line and S are the Borel sets. A random measure is 
a eolleetion of random variables M{A) E [0, cxd], one for eaeh set A E E. The expeetation of a 
random measure is ealled the mean measure, whieh we denote i^{A) = E[M{A)]. 

One subelass of random measures is the elass of eompletely random measures (Kingman, 
1967). A eompletely random measure is a random measure M(-) sueh that for any disjoint fi¬ 
nite eolleetion of sets Ai, A 2 , ..., An, the eorresponding realizations of the measure on those sets 
M{Ai), M{A 2 ), ..., M{An) are independent random variables. Completely random measures en- 
eompass many of the eonstruetions in Bayesian nonparametrie statisties. Some examples inelude 
the Poisson proeess, the beta proeess (Hjort, 1990), the Bernoulli proeess (Thibaux and Jordan, 
2007), and the gamma proeess (Ferguson, 1973). 

We begin by deseribing the simplest example of a eompletely random measure, the Poisson 
random measure. The Poisson random measure is eonstrueted from a Poisson proeess. It is ehar- 
aeterized solely by its mean measure i'{-) : £ —)■ [0, cx)], whieh is an arbitrary measure on {E,£). 
The eomplete eharaeterization of a Poisson random measure M(-) is that the marginal distribution 
of M (A) is a Poisson with rate z^(A). 

We represent a Poisson random measure with a set of atoms Oj in E and a sum of delta measures 
on those atoms (Cinlar, 2011), 

00 

M{A) = 

i=l 

The delta measure Sa^iA) equals one when ai E A and zero otherwise. Note there ean be a 
eountably infinite set of atoms, but only if ^{E) = cx).^ The distribution of the atoms eomes from 
the mean measure For eaeh finite measurable set A, the atoms in A are distributed aeeording 

to u{-)/u{A). 

We now expand the simple Poisson random measures to eonstruet more general eompletely 
random measures. Consider a Poisson proeess on the eross produet of E and the positive reals, 
E X ]R+. It is represented by a set {(a*, 10 *)}“^; eaeh pair eontains an atom ai and eorresponding 
weight Wi E M+. The eompletely random measure is 

00 

M{A) = E wAM)- (1) 

i=l 

This Poisson proeess is eharaeterized by its mean measure, ealled the Levy measure, whieh is 
defined on the eorresponding eross produet of sigma algebras, 

z/(-, ■) : E X -B(M+) —)■ [0, 00 ]. 

'This fact follows from the marginal distribution M{E) ~ Poisson(i/(L)) and that a Poisson random variable with 
rate equal to 00 is 00 almost surely. 
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We note that completely random measures also have fixed components, where the atoms are fixed 
in advance and the weights are random. But we will not consider fixed components here. 

We call the process homogenous when the Levy measure factorizes, i/(y4, R) = H{A)u{R)-, we 
call H the base measure. For example, in a nonparametric mixture of Gaussians the base measure 
is a distribution on the mixture locations (Escobar and West, 1995); in a nonparametric model of 
text, the base distribution is a Dirichlet over distributions of words (Teh et ah, 2006). 

We confirm that M(-) in Equation 1 is a measure. Eirst, M(0) = 0. Second, M{A) > 0 
for any A. Einally, M(-) satisfies countable additivity. Define A to be the union of disjoint sets 
{Ai, A 2 ,...}. Then M{A) = This follows from a simple argument, 

CO 00 00 00 00 CO 

6aXAk) = = M{A). 

k=l k=l i=l i=l k=l i=l 

We used Tonelli’s theorem to interchange the summations. 

One example of a completely random measure is the gamma process (Eerguson, 1973). It has 
Eevy measure 

iy{da,dw) = H{da)e~^'^/wdw. 

This is called the gamma process because if M ~ Gamma-Process(77, c) the random measure 
M{A) on any sei A E 8 is gamma distributed M{A) ~ Gamma(77(74), c), where H{A) is the 
shape and c is the rate (Cinlar, 2011). The gamma process has an infinite number of atoms—its 
Eevy measure integrates to infinity—but the weights of the atoms are summable when the base 
measure is finite (77(77) < 00 ) because E[M(77)] = Einally, when M{E) < 00 , we can 

normalize a completely random measure to obtain a random probability measure. Eor example, we 
construct the Dirichlet process (Eerguson, 1973) by normalizing the gamma process. 

3. CORRELATED RANDOM MEASURES 

The main limitation of a completely random measure is articulated in its definition—the random 
variables M{Ai) are independent. (Because they are normalized, random probability measures ex¬ 
hibit some negative correlation between the M{Ai), but cannot capture other types of relationships 
between the probabilities.) This limitation comes to the fore particularly when we see repeated 
draws of a random measure, such as in hierarchical Bayesian nonparametric models (Teh and Jor¬ 
dan, 2010). In these settings, we may want to capture and infer a correlation structure among 
M{Ai) but cannot do so with the existing methods (e.g., the hierarchical Dirichlet process). To 
this end, we construct correlated random measures. Correlated random measures build on com¬ 
pletely random measures to capture rich correlation structure between the measure at disjoint sets, 
and this structure can be estimated from data. 

We built completely random measures from a Poisson process by extending the space from 
simple atoms (in the Poisson process) to the space of atoms and weights (in a completely random 
measure). We build correlated random measures from completely random measures by extending 
the space again. As for a completely random measure, there is a set of atoms and uncorrelated 
weights. We now further supply each tuple with a “location”, a vector in M'^, and extend the mean 
measure of the Poisson process appropriately. A correlated random measure is built from a Poisson 
process on the extended space of tuples {(a*, tUj, 7^)}“]^. 
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In the completely random measure of Equation 1, the uncorrelated weights Wi give the measure 
at each atom. In a correlated random measure there is an additional layer of variables Xi, called 
the transformed weights. These transformed weights depend on both the uncorrelated weights Wi 
and a random function on the locations F{ii). In the random measure, they are used in place of 
the uncorrelated weights, 


OO 

M{A) = 5:xa.(a). (2) 

i=l 

It is through the random function F{-), which is drawn from a Gaussian process (Rasmussen and 
Williams, 2005), that the weights exhibit correlation. 

We first review the Gaussian process (GP) and then describe how to construct the transformed 
weights. A Gaussian process is a random function F^if) from —)■ M. It is specified by a 

positive-definite kernel function^ K {ii, if and mean function /i(£j). The defining characteristic of 
a GP is that each joint distribution of a collection of values is distributed as a multivariate normal. 


(F(4),...,F(C))~Ar(m,S), 
where rrii = /r(('i) and = K{ii, if. 

In a correlated random measure, we draw a random function from a GP, evaluate it at the 
locations of the tuples ii, and use these values to define the transformed weights. We specify 
the transformation distribution of Xi G M"*" denoted T(xj \ wi,F{ii)). It depends on both the 
uncorrelated weights Wi and the GP evaluated at ii. For example, one transformation distribution 
we will consider below is the gamma. 


a:* ~ Gamma(tei,exp{—F(('j)}). (3) 

But we will consider other transformation distributions as well. What is important is that the Xi are 
positive random variables, one for each atom, that are correlated through their dependence on the 
GP F. 

We have now fully defined the distribution of the transformed weights Xi that are used in the 
correlated random measure of Equation 2. We emphasize that in a completely random measure the 
weights are independent. The arguments that M(-) is a measure, however, only relied on its form, 
and not on the independence of the weights. (See Equation 2.) 

In summary, we build a correlated random measure by specifying the following: the mean 
measure of a Poisson process on atoms, weights, and locations u(da, dw, di); a kernel function 
K{ii, if between latent locations and a mean function m{ii); and the conditional transformation 
distribution T(- | Wi, F{ii)) over positive values. With these elements, wedraw a correlated random 
measure as follows: 


{{ai,Wi,ii)}Zi ~ 

P' ~ 

Xi ~ 

The random measure M(-) is in Equation 2. 


Poisson-Process (i/) 

(4) 

Gaussian-Process(m, K) 

(5) 

T{-\w,,F{i,)). 

(6) 


^This means that for a finite collection of inputs, the kernel produces a positive definite matrix. 
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Before turning to some eonerete examples, we set up some useful notation for eorrelated ran¬ 
dom measures. We denote the infinite set of tuples from the Poisson proeess (Equation 4) with 

Given these tuples, the proeess for generating a eorrelated random measure first draws from a Gaus¬ 
sian proeess (Equation 5), then transforms the weights (Equation 6), and finally eonstruets the mea¬ 
sure from an infinite sum (Equation 2). We shorthand this proeess with M ~ CorrRM(C, K, /i, T)? 

We note that eorrelated random measures generalize eompletely random measures. Speeifi- 
eally, we ean eonstruet a eompletely random measure from a eorrelated random measure by setting 
the mean measure u{dx, dw, di) to mateh the eorresponding eompletely random measure mean 
measure u^dx, dw) (i.e., the loeation distribution does not matter) and asserting that Xi = Wi with 
probability one. 

With the full power of the eorrelated random measure, we ean eonstruet eorrelated versions of 
eommon random measures sueh as the gamma proeess, the beta proeess, and normalized measures 
sueh as the Diriehlet proeess. We give two examples below. 

Example: Correlated Gamma process. We diseussed the gamma proeess as an example of a 
eompletely random measure. We now extend the gamma proeess to a eorrelated gamma proeess. 
Eirst, we must extend the mean measure to produee atoms, weights, and loeations. We speeify an 
additional distribution of loeations L{i) —we typieally use a multivariate Gaussian—and expand 
the mean measure of the gamma proeess to a produet, 

v{da,dw,dt) = L{di)H{da)e~^'^/wdw. 

Seeond, for the transformation distribution, we ehoose the gamma in Equation 3. Einally, we define 
the GP parameters, the kernel K (fj, ij) = ijij and a zero mean /r(^i) = 0. With these eomponents 
in plaee, we draw from Equation 4, Equation 5, and Equation 6. This is one example of a eorrelated 
random measure. 

Example: Correlated Diriehlet process. We ean normalize the measure to eonstruet a cor¬ 
related random probability measure from a eorrelated random measure. If M(-) is a eorrelated 
random measure, then 

G{A) = M{A)/M{E) 

is a eorrelated random probability measure. As we diseussed in Seetion 2, the Diriehlet proeess is a 
normalized gamma proeess (Eerguson, 1973). When we normalize the eorrelated gamma proeess, 
we obtain a eorrelated Diriehlet proeess. 

This eonstruetion requires that M{E) < oo. Proposition 2 in Appendix A.2 deseribes eondi- 
tions for well-defined normalization (i.e., M{E) < oo) in eorrelated random measures. Roughly, 
these eonditions require finiteness of the expeeted value of the transformed weights against the 
produet of the transformation distribution and the Levy measure. This eondition plays a eentral 
role in eonstrueting useful Bayesian nonparametrie models. 

^As in the construction of completely random measures, correlated random measure can also include fixed com¬ 
ponents, where the tuples are fixed, but the Xi are random. 
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The correlation structure. Finally, we calculate the correlation structure of a correlated ran¬ 
dom measure. Consider a measure, M ~ CorrRM(C, K, m, T). To understand the nature of the 
correlation, we compute Cov(M(y4), M{B)) for two sets A and B. 

We express this covariance in terms of the covariance between atom weights, 

oo oo 

Cow{M{A), M{B)) = EE Cow{X„X,)6,^{A)6,^{B). 

i=l j=l 

In other words, the covariance between the measure of two sets is the sum of the covariances 
between the transformed weights of the atoms in the two sets. In a completely random measure, 
the atom covariance is zero except when a* = Uj. Thus its covariance depends on the overlap 
between sets, and nothing else. In a correlated random measure, however, there may be non-zero 
covariance between the transformed weights. 

For now we are holding the underlying Poisson process C fixed, i.e., the atoms, untransformed 
weights, and locations. The covariance between transformed weights is 


Cov(Xi, X, I C) = E[X,Xj] - E[Xi]E[Xj]. (7) 

These expectations are driven by two sources of randomness. First there is a Gaussian process F, 
a random function evaluated at the fixed locations. Second there is the transformation distribution 
T. This is a distribution of an atom’s transformed weight, conditional on its untransformed weight 
and the value of the Gaussian process at its location (see Equation 6). 

Using iterated expectation, we write the conditional covariance in Equation 7 in terms of the 
conditional mean of the transformed weights, fii = E[Xi \ F{li)^Wi]. This is a function of the 
Gaussian process F. We rewrite the conditional covariance. 


Cov(Xi, Xj I C) = ElfXifXj] - E[fXi]E[nj], (8) 

where the expectations are taken with respect to the Gaussian process. For the first term, the 
distribution is governed by the distribution of the pair {F{ii), F{ij)), which is a bivariate normal. 
For the second term, the marginals are governed by F{ii) and F{ij), which are univariate normal 
distributions. 

4. HIERARCHICAL CORRELATED RANDOM MEASURES 

A correlated random measure takes us from a set of tuples to a random measure by way of a 
Gaussian process and a transformation distribution. When used in a downstream model of data, 
we can infer the latent correlation structure from repeated realizations of measures from the same 
set of tuples. It is thus natural to build hierarchical correlated random measures. Hierarchical 
correlated random measures are the central use of this new construction. 

In a hierarchical correlated random measure, we first produce a set of tuples {(a*, Wi, 
from a Poisson process and then re-use that set in multiple realizations of a correlated random 
measure. In each realization, we fix the tuples (weights, atoms, and locations) but draw from the 
Gaussian process anew; thus we redraw the transformed weights for each realization. 

As for the simple correlated random measure, we first specify the mean measure of the Poisson 
process z/(-), the kernel and mean for the Gaussian process K{-, •), and the conditional transfor¬ 
mation distribution T(- | Wi, G{ii)). We then draw n hierarchical correlated random measures as 
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follows: 


C ~ Poisson-Process(i/) 

Mj{A) ~ CorrRM(C, m, T). 

This is a hierarchical Bayesian nonparametric model (Teh and Jordan, 2010). There are mul¬ 
tiple random measures Mj. Each shares the same set of atoms, locations, and weights, but each 
is distinguished by its own set of transformed weights.^ The correlation structure of these trans¬ 
formed weights is shared across measures. We note that this construction generalizes the discrete 
infinite logistic normal (Paisley et ah, 2012), which is an instance of a normalized correlated ran¬ 
dom measure. 

We use this construction in a model of groups of observations yj, for which we must construct a 
likelihood conditional on the correlated RM. To construct a likelihood, many hierarchical Bayesian 
nonparametric models in the research literature use the integral with respect to the random measure. 
(This is akin to an unnormalized “expectation.”). Define Ma = f aM(da), and note that in a 
discrete random measure this integral is an infinite sum, 

Ma = '^^Xiai. (9) 


The jth observations are drawn from a distribution parameterized from this sum, yj ~ p(-1 Ma). 
For example, we will study models where Ma is a collection of rates for independent Poisson 
distributions. 

We present several examples of hierarchical correlated random measures. First, we develop 
correlated nonparametric Poisson factorization (CNPF) for factorizing matrices of discrete data. 
This is the example we focus on for posterior inference (Section 5) and our empirical study (Sec¬ 
tion 6). We then illustrate the breadth of correlated random measures with two other examples, both 
of which are latent feature models that build correlations into the class of models introduced by 
Griffiths and Ghahramani (2006). Finally, we discuss the discrete infinite logistic normal (DIFN) 
of Paisley et al. (2012). We show that DIFN is a type of normalized correlated random measure. 

4.1. Correlated Nonparametric Poisson Factorization 

Bayesian nonparametric Poisson matrix factorization (Gopalan et ah, 2014) combines gamma pro¬ 
cesses (Ferguson, 1973) with Poisson likelihoods to factorize discrete data organized in a matrix. 
The number of factors is unknown and is inferred as a consequence of the Bayesian nonparametric 
nature of the model. 

For concreteness we will use the language of patients getting diagnoses (e.g., patients going to 
the hospital and getting marked for medical conditions). In these data, each cell of the matrix y^j is 
the number of times patient u was marked for diagnosis j. The goal is to factorize users into their 
latent “health statuses” and factorize items into their latent “condition groups”. These inferences 
then let us form predictions about which unseen codes a patient might have. Though we focus 
our attention here on patients getting diagnoses, we emphasize that discrete matrices are widely 

our empirical study of Section 6 , we will also endow each with its own mean function to the Gaussian process, 
rrijf). Here we omit this detail to keep the notation clean. 
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found in modem data analysis problems. In our empirieal study, we will also examine matriees of 
doeuments (rows) organized into word eounts (eolumns) from a fixed voeabulary and user (rows) 
elieks over a fixed eolleetion of items (eolumns). 

We will use a hierarehieal eorrelated random measure to model these data, where eaeh group 
is a patient and the group-speeifie data are her veetor of per-diagnosis eounts. An atom Oj is 
a veetor of positive weights for eaeh diagnosis, drawn from independent gamma distributions, 
H{a) = Gamma(a,/3). When the posterior of these atoms is estimated from diagnosis eounts, 
they will represent semantie groups of eonditions sueh as “diabetes," “heart disease," or ‘eaneer.” 
Table 1 displays some of the atoms inferred from a set of patients from the Mayo Clinie. 

In using a eorrelated random measure, the idea is that patients’ expression for these eonditions 
are represented by the per-group weights Xj. Intuitively, these exhibit eorrelation. A patient who 
has “heart” eonditions is more likely to also have “vaseular” eonditions than “eaneer.” (To be 
elear, these groupings are the latent eomponents of the model. There are an unbounded number 
of them, they are diseovered in posterior inferenee, and their labels are not known.) Using these 
eorrelations, and based on her history, a eorrelated model should better prediet whieh diagnoses a 
patient will have. 

We now set up the model. We set the mean measure for the shared atoms to be 


iy{da, dw, di) = Gamma{da, a, /3)Normal((i£, 0, /wdw. 


We define the GP mean funetion to be a per-patient eonstant mu{ii) = /i^, where /i« ~ Af{0, cr^). 
These per-patient GP means aeeount for data where some patients tend to be sieker than others. 
We define the GP kernel funetion to be K (£j, ij) = ijij. 

Finally, we eonsider two different transformation distributions. The first transformation distri¬ 
butions is as in Equation 3, 


Xi ~ Gamma(tUi,exp{—F(('j)}). 


The seeond is 



where log(l -I- exp(-)) is known as the softplus funetion. With these definitions, we ean eompute 
the eonditional eovarianee for Xi and Xj using Equation 8. Table 4 displays some of positive 
eorrelations between atoms found on patient diagnosis eounts. These eorrelations are eaptured by 
the loeations, whieh are shared aeross patients, assoeiated with eaeh atom. Atoms with positive 
eovarianee in this model will have inferred loeations that have a large inner produet. 

With these eomponents in plaee, eorrelated nonparametrie Poisson faetorization is 


C ~ Poisson-Proeess(i/) 

Mu ~ CorrRM(C, iJ,u,K,T) 

Uu ~ p{- I MuO). 

The distribution of pu is a eolleetion of Poisson variables, one for eaeh diagnosis j, where 



( 10 ) 
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Recall that the atoms ai are each a vector of gamma variables, one per diagnosis, and so aij is the 
value of atom i for diagnosis j. For this model to be well defined each rate in Equation 10 must be 
finite. Using proposition 2 in Appendix A.2, it is finite almost surely if a; < 1. 

The sum that defines the rate of yui is an infinite sum of patient weights and condition weights. 
Thus, this model amounts to a factorization distribution for y^i- Given observed data, the posterior 
distribution of the atoms and transformed patient weights Xi gives a mechanism to form predic¬ 
tions. Note that the atoms a* are shared across patients, but through Xi each patient exhibits them 
to different degree. We discuss how to approximate this posterior distribution in Section 5. 

4.2. Correlated latent feature models 

Mixture models, such at the Dirichlet process mixture (Antoniak, 1974), are the most commonly 
used Bayesian nonparametric model. In mixture models, each observation exhibits only a single 
class. Many data, such as images of multiple objects, are better characterized as belonging to mul¬ 
tiple classes. Latent feature models posit that each observation is associated with some number 
of latent classes, taken from a set of features shared by all observations. For each observation, its 
likelihood depends on parameters attached to its active features (e.g., a sum of those parameters). 
Examples of latent feature models include factorial mixtures (Ghahramani, 1995) and factorial hid¬ 
den Markov models (Ghahramani and Jordan, 1997). (Latent feature models are closely connected 
to spike and slab priors (Ishwaran and Rao, 2005).) 

Bayesian nonparametric latent feature models allow the number of features to be unbounded. 
As an example, consider analyzing a large data set of images. Latent features could correspond 
to image patches, such as recurring objects that appear in the images. In advance, we might not 
know how many objects will appear in the data set. BNP latent feature models attempt to solve 
this problem. BNP latent feature models have been used in many domains such as image denoising 
(Zhou et ah, 201 1) and link prediction in graphs (Miller et ah, 2009). 

The most popular BNP latent feature model is the hierarchical beta-Bemoulli process (Thibaux 
and Jordan, 2007). This process was originally developed as the Indian Buffet process, which 
marginalized out the beta process (Griffiths and Ghahramani, 2006). Before developing the corre¬ 
lated version, we review the beta-Bemoulli process. 

The beta process is a completely random measure with atom weights in the unit interval (0,1). 
Its Levy measure is 


p^da^dw) = H{da)aw ^(1 —tu)“ ^ 

We use the beta process in concert with the Bernoulli process, which is a completely random 
measure parameterized by a random measure with weights in the unit interval, i.e., a collection 
of atoms and corresponding weights. A draw from a Bernoulli process selects each atom with 
probability equal to its weight. This forms a random measure on the underlying space, where each 
weight is one or zero (i.e., where only a subset of the atoms are activated). Returning to latent 
feature models, the beta-Bernoulli process is 

B ~ Beta-Process(i7, a) 

Bn ~ Bemoulli-Process(iJ) 
yn ~ p(- I Bn) 
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The beta process generates the feature atoms; the Bernoulli processes chooses which features are 
active in each observation. 

This model is built on completely random measures. Thus, the appearances of features in 
each observation are independent of one other. Correlated random measures relax this assumption. 
Consider a latent feature model of household images with image patch features. The completely 
random assumption here implies that the appearance of a spoon is independent of the appearance 
of a fork. Our construction can account for such dependencies between the latent features. Below 
we will give two examples of correlated nonparametric latent feature models, one based on the 
beta process and the other based on the gamma process. 

One method to develop a correlated beta-Bernoulli process is to define transformed weights at 
the Bernoulli process level. We define the transformation distribution to be 


Bernoulli(cT((T + F(£j))), 


where a is the sigmoid function a(x) = 1/(1 + exp{—a;}). Thus the beta-Bernoulli correlated 
latent feature model is 


C ~ Poisson-Process (if (da) L((i£) ate ^(1 —te)" 
Mn ~ CorrRM(C,/t, iT,T). 


(We defer defining /t and K, as they will be application specific.) 

We do not need to use the beta process to define a correlated latent feature model; what is 
important is that the per-observation weights are either one or zero. For example, if the top level 
process is a gamma process, which produces positive weights, then we can define the transforma¬ 
tion distribution to be 



The resulting gamma-Bemoulli correlated latent feature model is 


C ~ Poisson-Process (if (da) L(d(')e ^'^/wdw) 
Mn ~ ConRM{C, fx,K,T). 


The beta-Bernoulli process uses only a finite number of features to generate a finite number of 
observations. In Appendix A.3, we give some conditions under which the correlated latent feature 
models do the same. 

4.3. Discrete infinite logistic normal 

The correlated random measure construction that we developed generalizes the discrete infinite 
logistic normal (DILN) (Paisley et ah, 2012). DILN is an example of a normalized hierarchical 
correlated random measure; its atom weights come from a normalized gamma random measure, 
i.e., a Dirichlet process. 

DILN was developed as a Bayesian nonparametric mixed-membership model, or topic model, 
of documents. In DILN, each document mixes a set of latent topics (distributions over terms), 
where the per-document topic proportions can exhibit arbitrary correlation structure. This is in 
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contrast to a hierarchical Dirichlet process topie model (Teh et ah, 2006), where the topic propor¬ 
tions are nearly independent. 

We will express DILN in terms of a eorrelated random measure. The observations Wuj are 
eategorieal variables, i.e., word j in doeument u. Set the kernel K{ii, ij) = ijij, and set a and (3 
to be positive hyperparameters. Set the transformation distribution to be 

Xi ~ Gamma(/3t(;j,exp{—F(£i)}). 


With our construetion, DILN is 

C ~ Diriehlet-Process(a, iL((ia) X 0, erf Jd)) 

Mu ~ Normalized-CorrRM(C, 0, iL, T) 

^uj Mu 

Wuj ^ Zuj • 

Note that the shared tuples come from a Dirichlet process, i.e., a normalized gamma process. When 
modeling doeuments, the base distribution over atoms H(da) is a Diriehlet distribution over the 
voeabulary. 

This is a mixed-membership model—there is an additional layer of hidden variables Zuj, drawn 
from the random measure, before drawing observations Wuj- These hidden variables Zui will be 
atoms, i.e., distributions over the voeabulary, from the set of shared tuples and drawn with prob¬ 
ability according to the per-doeument transformed weights. Each observation Wuj is drawn from 
the distribution over terms given in its atom Zuj. 

Paisley et al. (2012) show the normalization step is well defined when ai < 1. Viewing DILN 
through the lens of eorrelated random measures makes elear what ean be ehanged. For example, 
the top level ehoiee of the Diriehlet proeess is not oritieal. It eould be any random measure that 
plaees finite total mass, sueh as a gamma proeess or a beta proeess. 

4.4. Connection to Dependent Random Measures 

Finally, we diseuss the detailed eonneetion between eorrelated random measures and dependent 
random measures MacEachem (1999). Dependent random measures are a eolleetion of measures 
indexed by eovariates. A broad elass of dependent random measures ean be ereated by thinning a 
Poisson proeess (Foti et ah, 2013). Given a draw from a Poisson proeess (a*, tCj, where a 

are atoms, i are loeations in the eovariate spaee, and w are weights, the thinned dependent random 
measure B for user u with eovariate 9u is 

OO 

Bu{A) = '^XuiSajA) 

i=l 

Xui ~ w3B,&mo\i\\i{k{9u,^i)), 

where A; is a function from T x L —)■ [0,1]. This construction is related to CorrRMs. Consider the 
correlated random measure 


Xui ~ tCiBemoulli(cT(F(fj))) 

Mu{A) ~ CorrRM((aj,M;i,£i)“i,m,iL,T). 
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From, this we can see that Bu{A) = Mu{A) when a{Fu{ii)) = k{6u, U)- In other words, thinned 
dependent random measures are equivalent to a type of correlated random measure where the 
random function, Fu associated with each user is known and given by the covariate. 

We note that dependent random measures map from covariates to measures. Thus they can be 
viewed as a type of measure-valued regression. In parallel, correlated random measures use latent 
covariates. In this sense, they can be viewed as measure-valued 

5. VARIATIONAL INFERENCE EOR CORRELATED 
NONPARAMETRIC POISSON EACTORIZATION 

Computing the posterior is the central computational problem in Bayesian nonparametric model¬ 
ing. However, computing the posterior exactly is intractable. To approximate it, we use variational 
inference (Jordan et ah, 1999; Wainwright and Jordan, 2008), an alternative to Markov chain Monte 
Carlo. Variational inference has been used to approximate the posterior in many Bayesian nonpara¬ 
metric models (Kurihara et ah, 2007; Doshi-Velez et ah, 2009; Paisley and Carin, 2009; Wang and 
Blei, 2012) and has been of general interest in statistics (Braun and McAuliffe, 2007; Faes et ah, 
2011; Ormerod and Wand, 2012). Here we develop a variational algorithm for correlated nonpara¬ 
metric Poisson matrix factorization (Section 4.1). 

Variational inference turns approximate posterior computation into optimization. We set up 
a family of distributions over the latent variables Q = {q'(-)} ^nd then find the member that 
minimizes the KL divergence to the exact posterior. Minimizing the KL divergence to the posterior 
is equivalent to maximizing the Evidence Lower BOund (ELBO), 

q*{-) = argmaxEq^(^)[logp(?/,0 - logg(0], (H) 

q&Q 

where ^ are the latent variables and y are the observations. In this paper we work with the mean- 
field family, where the approximating distribution fully factorizes. Each latent variable is indepen¬ 
dently governed by its own variational parameter. 

To develop a variational method for CNPE, we give a constructive definition of the gamma pro¬ 
cess and introduce auxiliary variables for the Gaussian process. We then define the corresponding 
mean-field family and show how to optimize the corresponding ELBO. 

Additional latent variables. We first give a constructive definition of a homogeneous gamma 
process. We scale the stick breaking construction of Sethuraman (1994) as used in Gopalan et al. 
(2014); Zhou and Carin (2015). We define stick lengths Vk from a beta distribution and a scaling s 
from a gamma distribution. The weights of the gamma process Wk are from the following process, 

S Gamma(a, c) 

Vk ~ Beta(l, a) 

Wk = s - ^’j)) ■ 

We treat the gamma shape a and rate c as latent variables (with gamma priors). 

We adapt the auxiliary variable representation of zero-mean Gaussian processes with linear 
kernels (Paisley et ah, 2012) to more general Gaussian processes. Suppose Gn is a Gaussian 
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process with mean /i„ and a linear kernel. Let d be a standard Gaussian veetor with same dimension 
as ik- We ean write the proeess as 


G{ik)=iU + f^n{ik). 

This lets us evaluate likelihoods without matrix inversion. 

The mean-field family. With the latent variables for the gamma and Gaussian proeesses in hand, 
we now define the mean-field variational distribution. We use the following approximating family 
for eaeh latent variable 


q{xku) = Gamma{al^, 

q{aki) = Gamma(afci,/3fci) 

q{s)q{Vk)q{ik) = 
q{dMHu) = 
q{a)q{c) = S&Sc, 


where 6r represents a point mass at r. As in prior work on variational inferenee for Bayesian 
nonparametries, we use delta distributions in the top level stiek eomponents, sealing, and hyperpa¬ 
rameters for analytie traetability (Liang et ah, 2007; Paisley et ah, 2012; Gopalan et ah, 2014).^ 

Bayesian nonparametrie models eontain an infinite number of latent variables. Following Blei 
and Jordan (2005), we truneate the variational approximation of the stieks 14 and assoeiated tuples 
to T. In praetiee it is straightforward to reeognize if the truneation level is too small beeause all of 
the eomponents will be populated in the fitted variational distribution. In our studies T = 200 was 
suffieient (Seetion 6). 

The goal of variational inferenee is to find the variational parameters—the free parameters of 
q, sueh as ik —that maximize the evidenee lower bound. In Appendix A.4, we deseribe how to 
optimize the ELBO (Equation 11) with stoehastie variational inferenee (Hoffman et ah, 2013). 
Code will be made available on GitHub. 

We have derived a variational inferenee algorithm for one example of a eorrelated random 
measure model. Deriving algorithms for other examples follows a similar reeipe. In general, we 
ean handle inferenee for eovarianee funetions with indueing variables (Titsias, 2009) and subsam¬ 
pling (Hensman et ah, 2013). Eurther, we ean address models with intraetable expeetations—e.g., 
those arising from different transformation distributions or Eevy measures—with reeent methods 
for generie and noneonjugate variational inferenee (Salimans et ah, 2013; Ranganath et ah, 2014; 
Wang and Blei, 2013). 


6. EMPIRICAL STUDY 

We study eorrelated nonparametrie Poisson faetorization (CNPE) and eompare to its uneorrelated 
eounterpart on a large text data set and a large data set of medieal diagnosis eodes. Quantitatively, 
we find that the eorrelated model gives better predietive performanee. We also find that it reveals 
interesting visualizations of the posterior eomponents and their relationships. 

^This corresponds to variational expectation-maximization, where the E step computes variational expectations 
and the M step takes MAP estimates of the latent variables with delta factors (Beal, 2003). 
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6.1. Study Details 

Before giving the results, we deseribe the baseline models, the evaluation metrie, and the hyperpa¬ 
rameter settings. 

Baseline models. As a baseline, we eompare against the uneorrelated variant of Bayesian non- 
parametrie Poisson faetorization. As we mentioned in Seetion 3, uneorrelated random measures 
can be cast in the correlated random measure framework by setting a transformation distribution 
that does not depend on the Gaussian process. 

Recall that Xik is the weight for data point i on component k. In the simplest Bayesian non- 
parametric Poisson factorization model, the transformation distribution is 


Xik ~ Gamma(r(;fc, 1). 


This is a two-layer hierarchical gamma process, and we abbreviate this model HGP The first 
layer contains shared atoms and weights. The second layer is a gamma process for each data point 
(e.g., patient or document), with base measure given by the first layer’s measure. 

The second uneorrelated model places further hierarchy on the log of the scale parameter of 
the Gamma, 


Xik ~ Gamma (tffc, exp(—mi)). 


Here rrii ~ Normal(a, b), which captures variants in the row sums for each data point (i.e., how 
many total diagnoses for a patient or how many words for a document). We call this model the 
scaled HGP. 

Appendix A.5 gives inference details for both uneorrelated models. 

Evaluation metric. We compare models with held out perplexity, a standard metric from infor¬ 
mation retrieval that relates to held out predictive likelihood (Geisser, 1975). We use the partial 
observation scenario that is now common in topic modeling (Wallach et ah, 2009). The idea is to 
uncover components from most of the data, and then evaluate how well those components can help 
predict held out portions of new partially-observed data. 

For each data set, we hold out 1,000 examples (i.e., rows of the matrix). From the remaining 
examples we run approximate posterior inference, resulting in approximate posterior components 
E[afc] that describe the data. With the 1,000 held out examples, we then split each observation (i.e., 
columns) randomly into two parts, 90% in one part (j/test) and 10% in the other (j/obs)- We condition 
on the j/obs (and that there is a test word) and calculate the conditional perplexity on j/test- A better 
model will assign the true observations a higher probability and thus lower perplexity. Formally, 
perplexity is defined as 



Perplexity measures the average surprise of the test observations. The exponent is the average 
number of nats (base e bits) needed to encode the test sample. 
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Figure 1: A graph of the latent component correlations found on The New York Times. This figure is best viewed on a computer with 
zoom. The sizes of the components are related to their frequency. The correlation structure consists of several tightly connected groups 
with sparse links between them. 




































1: Mammogram, Routine medical exam, Lumbago, Cervical Cancer Screening, Hypothyroidism 
2: Hypertension, Hyperlipidemia, Coronary atherosclerosis. Prostate Cancer Screening, Vaccine for influenza 
3: Acute pharyngitis. Cough, Myopia, Vaccine for influenza. Joint pain-shlder 
4: Child Exam, Vaccine for flu. Otitis media. Upper respiratory infection, pharyngitis 

5: Long-term anticoagulants. Atrial fibrillation. Hypertension, Congestive Heart Failure, Chronic airway obstruction 

6: Normal pregnancy. Normal first pregnancy. Cervical cancer screening. Delivery, Conditions antepartum 

7: Diabetes-2, Hypertension, Hyperlipidemia, Uncontrolled Diabetes-2, Diabetes-2 with ophthalmic manifestations 

8: Depression, Dysthymia, Anxiety state. Generalized anxiety disorder. Major depressive affective disorder 

9: Joint pain lower leg. Arthritis lower leg. Local arthritis lower leg. Post-procedural status. Follow-up surgery 

10: Allergic rhinitis. Desensitization to allergens. Asthma, Chronic allergic conjunctivitis. Chronic sinusitis 

11: Heart valve replacement. Prostate cancer. Lung and bronchus cancer. Secondary bone cancer. Other lung disease 

12: Morbid obesity. Obesity, Obstructive sleep apnea. Sleep apnea. Intestinal bypass status 

13: Acne, Convulsions, Abnormal involuntary movements. Cerebral palsy. Long-term use meds 

14: Abnormality of gait. Personality change. Persistent mental disorders. Lack of coordination. Debility 

15: Attention disorder w hyperactivity. Attention disorder w/o hyperactivity. Adjustment disorder. Opposition defiant disorder. Conduct disturbance 
16: Diseases of nail. Corns and callosities, Dermatophytosis of nail. Ingrowing nail. Other states following surgery of eye and adnexa 
17: Alcohol dependence. Tobacco use disorder. Alcohol abuse. Other alcohol dependence-in remission. Other alcohol dependence-continuous 
18: Schizophrenia-Paranoid, Long-term use meds. Schizophrenia, Schizophrenia-paranoid-chronic, Drug monitor 
19: Female breast cancer. Personal history of breast cancer. Lymph cancer. Carcinoma in situ of breast, lymphedema 

20: Child health exam. Vaccination for disease. Vaccinations against pneumonia. Need for prophylactic vaccination against viral hepatitis. Procedure 


Table 1: The top twenty components on the Mayo Clinic data. We find that each factor forms a medically meaningful grouping of 
diagnosis codes. For example, there are allergy, pregnancy, and alcohol dependence components. 



Data 

HOP 

Scaled HOP 

CNPF 

Softplus-CNPF 

NYT 

3570 

3283 

2755 

2768 

Mayo Clinic 

1251 

877 

779 

780 

ArXiv 

5713 

4076 

2107 

2120 


Table 2: A summary of the predietive results on The New York Times, the Mayo Clinic and ArXiv 
clicks. The correlated models outperform both the uncorrelated models. Adding per observation 
scalings improves predictions. 

For the models we analyze, we compute this metric as follows. For each held out data point 
we hold the components fixed (i.e., Eg[afc]) and use the 10% of observed columns to form a vari¬ 
ational expectation of the per-data point weights Eq[xik\. In all models, we compute the held out 
probability of unobserved columns by using the multinomial conditioning property of Poissons. 
Conditional on there being a test observation, it is assigned to a particular column (e.g., a word or 
a diagnostic code) with probability equal to that column’s normalized Poisson rates. Formally, 

( _ [^kj] 

We measure the probability of the |/test columns under this distribution. This evaluates how well 
the discovered components can form predictions in new and partially observed observations. 

Hyperparameters. We set the hyperparameters on the base distribution to have shape .01 and 
rate 10.0. We set the truncation level T to be 200, and found that none of the studies required more 
than this. We set the dimensionality of the latent locations to be 25 and the prior variance to be 
We keep these hyperparameters fixed for all data. 

In the algorithm, we use Robbins Monro learning rates, (50 +for the text data and (100 + 
for the medical codes, and click data. 

6.2. Results 

We evaluate our posterior fits on text, medical diagnosis data, and click data. 

The New York Times. We study a large collection of text from The New York Times. Rows are 
documents; columns are vocabulary words; the cell is the number of times term j appeared in 
document i. After preprocessing, the data contains 100,000 documents over a vocabulary of 8,000 
words. Analyzing text data with a Poisson factorization model is a type of topic modeling (Blei, 
2012 ). 

Table 2 summarizes the held-out perplexity. We find that the correlated model outperforms 
both of the uncorrelated models. Note that even in the uncorrelated model, adding a per-document 
scale parameter improves predictions. 

The model also provides new ways to explore and summarize the data. Figure 1 is a graph of 
the positive correlation structure in the posterior for the top fifty components, sorted by frequency; 
Table 3 contains a list of the top ten negative correlations. To explore the data, we compute corre¬ 
lations between components by using their latent locations through the covariance function of the 
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Israel, Israeli, Palestinian, Jewish, peaee 
league, players, sports, baseball, team 

room, bedroom, bath, taxes, market 
news, book, magazine, editor, books 

war, iraq, military, army, iraqi 
spaee, kim, koeh, moon, nasa 

roek, musie, band, Jones, album 
family, tax, board, paid, friend 

water, plant, garden, plants, trees 
union, soviet, moseow, russian, gorbaehev 

island, water, beaeh, river, sea 
theater, broadway, play, show, produetion 

building, housing, buildings, square, projeet 
news, book, magazine, editor, books 

bush, administration, elinton, offieials, house 
spaee, kim, koeh, moon, nasa 

room, bedroom, bath, taxes, market 
indian, atlantie, easino, trump, las 

eentury, small, wine, plaee, white 
eontraet, los, angeles, league, ehieago 

Table 3: The top ten pairs of negatively eorrelated eomponents inferred from the New York Times. 
Eaeh pair of eomponents are highly unlikely to eooeeur in an artiele. 

Gaussian proeess. For these fits, the eovarianee between ^i and is the eorrelation between 
two eomponents is thus 

P 

_ 

We find the eorrelation struetures eontains highly eonneeted groups, eonneeted to eaeh other by 
“glue”, individual eomponents that bridge larger groups. For example the bottom left eonneeted 
group of “international polities" is glued together with the top left group of “finanee" through the 
“politieal parties" eomponent and the “law” eomponent. 

As we said above, we set the truneation level of the approximate posterior to 200 eomponents. 
Figure 2 plots the atom weights of these 200 eomponents, ordered by size. The posterior uses 
about 75 eomponents; the truneation level is appropriate. 

Medical history from the Mayo Clinic. We study medieal eode data from the Mayo Clinie. 
This data set eontains of all the International Classifieation of Diseases 9 (ICD-9) diagnosis eodes 
(also ealled billing eodes) for a eolleetion of patients over three years. The diagnosis eodes mark 
medieal eonditions, sueh as ehronie isehemie heart disease, pure hypereholesterolemia, and type 
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Figure 2: The weights in eaeh tuple on the New York Times ordered by magnitude. Around 75 
components are used. 



Figure 3: The weights in each tuple ordered by magnitude. Around 50 of components are used. 
Though similar in size to the NYT data set, fewer components are used. The component usage has 
a steeper decline. 
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- Long-term anticoagulants, Atrial fibrillation, Hypertension, Congestive Heart Failure, Chronic airway obstruction 

- Heart valve replacement. Prostate cancer. Lung and bronchus cancer. Secondary bone cancer. Other lung disease 

- Abnormality of gait. Personality change. Persistent mental disorders. Lack of coordination. Debility 

- Schizophrenia-Paranoid, Long-term use meds. Schizophrenia, Schizophrenia-paranoid-chronic, Drug monitor 

- Attention deficit disorder with hyperactivity. Attention deficit disorder without hyperactivity. Adjustment 

disorder with disturbance of emotions and conduct. Opposition defiant disorder. Conduct disturbance 

- Schizophrenia-Paranoid, Long-term use meds. Schizophrenia, Schizophrenia-paranoid-chronic, Drug monitor 

- Depression, Dysthymia, Anxiety state. Generalized anxiety disorder. Major depressive affective disorder 

- Alcohol dependence. Tobacco use disorder. Alcohol abuse. Other alcohol dependence-in remission. 

Other alcohol dependence-continuous 

- Long-term anticoagulants. Atrial fibrillation. Hypertension, Congestive Heart Failure, Chronic airway obstruction 

- Diabetes-2, Hypertension, Hyperlipidemia, Uncontrolled Diabetes-2, Diabetes-2 with ophthalmic manifestations 

- Hypertension, Hyperlipidemia, Coronary atherosclerosis. Prostate Cancer Screening, Vaccine for influenza 

- Long-term anticoagulants. Atrial fibrillation. Hypertension, Congestive Heart Failure, Chronic airway obstruction 

- Heart valve replacement. Prostate cancer. Lung and bronchus cancer. Secondary bone cancer. Other lung disease 

- Female breast cancer. Personal history of breast cancer. Lymph cancer. Carcinoma in situ of breast. Lymphedema 

- Depression, Dysthymia, Anxiety state. Generalized anxiety disorder. Major depressive affective disorder 

- Schizophrenia-Paranoid, Long-term use meds. Schizophrenia, Schizophrenia-paranoid-chronic, Drug monitor 

- Diabetes-2, Hypertension, Hyperlipidemia, Uncontrolled Diabetes-2, Diabetes-2 with ophthalmic manifestations 

- Schizophrenia-Paranoid, Long-term use meds. Schizophrenia, Schizophrenia-paranoid-chronic, Drug monitor 

- Mammogram, Routine medical exam. Lumbago, Cervical Cancer Screening, Hypothyroidism 

- Female breast cancer. Personal history of breast cancer. Lymph cancer. Carcinoma in situ of breast. Lymphedema 


Table 4: The top ten correlations among the heavily used components in the Mayo Clinic data. 
We find several medically meaningful relationships between latent components. For example, the 
relationships between obesity and type 2 diabetes is well established. 

2 diabetes. The entire collection contains 142,297 patients and 11,102 codes. Patients are rows in 
the matrix; codes are columns; each cell marks how many times the patient was assigned to the 
code. 

Table 2 summarizes the held-out perplexity. Again, the correlated model does best. Further, as 
for text modeling, it is important to allow a patient-specific scale parameter to capture their relative 
health. Figure 3 plots the posterior sticks, ordered by size. The approximate posterior uses about 
50 components, using the first 20 more heavily. 

Table 1 contains the 20 most commonly used components. The components correspond to med¬ 
ically meaningful groups of conditions, such as obesity (12), type 2 diabetes (7), and breast malig¬ 
nancy (19). The top positive correlations are in Table 4. There are several meaningful correlations, 
such as depression & alcohol dependency, and using anticoagulants & hypertension/lipidemia. 
Note that the relationship between schizophrenia and type 2 diabetes is an active area of research 
in medicine (Suvisaari et ah, 2008; Liu et ah, 2013). 
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ArXiv click data. Finally, we examine user eliek data from the ArXiv, an online repository of 
researeh artieles. The ArXiv initially foeused on physios artioles but has now expanded to many 
other domains, inoluding statistios. This data set oontains the number of times eaeh user olioked 
on an artiole; it spans 50,000 users and 20,000 artioles. Building models of suoh data is useful, for 
example, to develop reoommendation systems that find interesting artioles to ArXiv readers. 

As for the other data, we hold out some of the olioks and try to prediot them. Table 2 sum¬ 
marizes the results. We find similar results as on our other two data sets. The oorrelated models 
outperform the uneorrelated models on predioting unseen elioks for new users. We find that the 
standard CNPF model outperforms the softplus CNPF on all of our data sets. 

7. DISCUSSION 

We present oorrelated random measures. Correlated random measures enable us to oonstruot ver¬ 
sions of Bayesian nonparametrio models that oapture oorrelations between their oomponents. We 
oonstruot several examples of suoh models, and develop the inferenoe algorithm in detail for one of 
them, oorrelated nonparametrio Poisson faetorization. With this model, we find that the oorrelated 
random measure improves prediotions and produoes interesting interpretable results. 

Random probability measures sueh as the Diriehlet prooess are eonsistent for density estima¬ 
tion, so why might one prefer a oorrelated random measure over a eompletely random measure? 
We eonjeeture that oorrelated random measures make more effioient use of the data. One promis¬ 
ing avenue of future researeh is to study the rates of oorrelated random measures versus eompletely 
random measures. 

Correlated random measures model latent oorrelations in the data, while dependent random 
measures model oorrelations based on observed oovariates. Combining these two ideas to inoorpo- 
rate oorrelations both observed and latent yields a broader olass of random measures that oan model 
many real world phenomena. Another avenue of future researeh is to study this oonstruotion both 
methodologioally and praotieally. 

We define oorrelated random measures by eombining Poisson and Gaussian prooesses. How¬ 
ever, we note that other prooesses ean also be used for the souroe of tuples {ai,Wi,ii) and the 
random funetion. For example, the DILN model of Seetion 4.3 uses a Diriehlet proeess to form its 
tuples; another way to generate tuples would be through the Pitman-Yor proeess (Pitman and Yor, 
1997; Teh et ah, 2006). 

Similarly, though we used a Gaussian prooess to define a random funetion from latent loeations 
to real values, there are other possibilities. For example we oan replaoe the GP with the student-T 
prooess (Shah et ah, 2014). Or, if we restriot the latent loeations to be positive then we oan use 
them to subordinate, as an index to, another stoohastio prooess, sueh as Brownian motion. Also, we 
oould use disorete random funotions to form feature groups. We leave these extensions for possible 
future researeh. 
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A. APPENDIX 

In the appendix we deseribe the Laplaee transform of CorrRMs, establish eonditions for integra- 
bility, derive inferenee for eorrelated nonparametrie Poisson faetorization, and show the ehanges 
to inferenee needed for variational inferenee in our comparison models. 

A.l. Laplace Transform 

We ean eharaeterize the Laplaee funetional of a Poisson driven eorrelated random measure in terms 
of a Gaussian expeetation. 

Proposition 1. Let M be drawn from a correlated random measure with Poisson mean measure v, 
Gaussian process parameters m and K with Gaussian process draw F, transformation distribution 
p(x\-), and let g be a positive, real valued, £ measurable function, then the Laplace functional 
£'[gMg] ^ where v = v(da,dw,d€)p(dx\F(tf),wf). 

The Laplaee funetional ean be used for analytie eomputation of properties of eorrelated random 
measures as it eharaeterizes all moments of integrals with respeet to this random measure. 
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Proof of Proposition 1. Let (a*, Wj, be the atoms of the Poisson random measure drawn 
with mean u. Then, conditional on F by the transformation property of Poisson random measures 
(oj, Wi, ii, Xi) is a Poisson random measure. This follows as given F, tUj, ^i, the Xi are conditionally 
independent (Kingman, 1993). The mean measure of this Poisson random measure given F is 

= ^{da, dw, di)p{dx\F(ii), Wi). (12) 

The correlated random measure M can be written as integral with respect to the Poisson random 
measure N with mean u as Nf{a)x. Thus, 

E[e-"^^] =E[E[e-"^3|F]] = 

where the last step follows from the Laplace functional of a Poisson process. 

A.2. Integrability 

Establishing conditions for integrability with respect to the random measure aids in the construc¬ 
tion of models (consider the aforementioned correlated probability measures). Here we provide 
a proposition (using the notation x A y is the smaller of x and y) that completely characterizes 
integrability of positive functions with respect to a Poisson driven correlated random measure. 

Proposition 2. Let M be drawn from a correlated random measure with Poisson mean measure 
u, Gaussian process parameters m and K with Gaussian process draw F, transformation distri¬ 
bution p{x\-), and let g be a positive, real valued, £ measurable function. Then, Mg is finite with 
probability VF{bg{a)x A 1 < cxd) , where z> = v{da, dw, di)p{dx\F{ii),Wi). 

Note the probability is a Gaussian expectation. This proposition parallels the integrability 
conditions based on the mean measure for Poisson random measures (Cinlar, 201 1). We use this 
proposition to establish finiteness in our examples. 

Proof of Proposition 2. We first begin by noting that 

V{Mg <oo) = limEe-^"^® = limE[e-"(i-^“’'^‘“’^)] = E[lime-"(i-^-’"^'“'^)], 

r^O r^O r^O 

by Proposition 1 and where the last equality follows from the dominated convergence theorem and 
the positivity of rg{a)x. 

The function g{a)x A 1 dominates (1 — for r < 1, thus when i>g{a)x Al < oo, then 

limr^o Similarly (1 — dominates (1 — e~^){g{a)x A 1), thus when 

i>g{a)x Al = oo, limr_j.o ^ Putting this all together gives 

E[lim ^ E[6{i'g{a)x A 1 < oo)] = VF{bg{a)x A 1 < oo). 

r—^-O 

Thus V{Mg < oo) = VF{bg{a)x A 1 < oo). 

We can use Proposition 2 finiteness of M{E) by letting g equal to 1 everywhere. That is, 

V{M{E) < oo) = VFipx A 1 < oo). 

Both propositions naturally extend to the hierarchical case when the shared tuples come from 
a Poisson process. 
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A.3. Finiteness 


Finiteness of the CorrRM in correlated nonparametric Poisson factorization. The Poisson 
rate is an inner produet between iid gamma variables and a draw from a eorrelated random measure. 
Thus the Poisson rate is finite almost surely if draws from the eorrelated random measure produee 
finite measures almost surely. We show this using Proposition 2. 

In this ease the mean of the conditional Poisson random measure is 

, -FU) 

i'(d£,da,dw,dx) = L(di)H(da)e ‘^‘^/wdw—— — —x'^ dx. 

T{w) 

We seek to integrate x A 1 with respect to this measure. Using a change of variables, this integral 
is equal to 


X A lL{di)H{da)e /wdw- 




T{w 

< [ L{d£)H{da)e~^^/wdwx 




, Fro 


Tiw) 


= H{E) j L{d£)e-^^e^^^^dw 
= [ e^^^^L{di). 


Thus the Poisson rate is finite when J e^^^"'L{di) < oo. By Tonelli’s theorem and assuming G has 
a linear kernel and a constant mean /r which has a mean zero and cr^ variance prior, 


E[ j e^^^^L{di)] = J 


elKmL{di) + exp 



If is bounded, then the measure is finite almost surely regardless of the density L{di). The 

linear covariance function is unbounded. In this case, from the above equality we have 


r 1 

E[ J = - 




< OO, 


for < 1. Putting this all together means that the Poisson rate is almost surely finite for linear 
kernel when the locations are drawn from an isotropic Gaussian with variance less than 1. The 
same conditions transfer to the softplus variant as exp(x) > log(l + exp(x)). 


Finiteness of the beta-Bernoulli correlated latent feature model. We use Proposition 2 to 
establish finiteness of this measure. We note that finiteness in the number of features follows 
from summability of the probability that each feature is on. The mean of the conditional random 
measure of the probabilities is 

z>(d£, da, die, dx) = L{di)H{da)aw~^{1 — = a{a~^{w) + F{i))}dwdx. 
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The integral of a; A 1 is the same as x, as x is bounded by 1 due to the logistic function. Thus, 


j xL{di)H{da)aw ^(1 —tu)" ^l{xi = a(a ^{w) + F{i))}dwdx 
= H{E) f + F{i))dwL{di). 


We assume that H{E) is a finite, like in a probability distribution. Thus the finiteness of this 
quantity only depends on the interior integral. We can split this integral over each half of the unit 
interval. The integral of the second half is 



aw ^(1 — ^a{a (w) + F{i))dwL{di) 


< 2 / a{l — wY dw = 


2(a- 1) 


< CX). 


This means finiteness only depends on the integral with respect to the first part of the unit interval. 
The integral over the first half is 



aw 


-1 


(1 


w 


,a-l 


O' 


a ^{w) + F{l))dwL{di) 



aw — to) 


0 — 1 



q;(1 — w) 


0-2 


1 + 

W 

1 


dwL{di) 


- + e-Fi) 

1 — W 


dwL{di) 


< / e^^^^L{dl) / a{l - w)^-^dw = C / 


for some constant C. Following the same argument for CNPF above, this means the measure is 
finite when J e^^^^L{di) is finite. The measure is almost surely finite for a linear kernel when the 
locations are drawn from an isotropic Gaussian with variance less than 1 and for bounded variance 
covariance functions. 


Finiteness of the gamma-Bernoulli correlated latent feature model. The sum of the activation 
probabilities can be given as 


^ A ^ Wiexp{F{ii)) 

^ Wiexp{F{ii)) + 1 

OO 

< ^w*exp(F(fi)). (13) 

i=l 

This is the same as the finite measure condition in CNPF. Thus, the measure is almost surely finite 
for a linear kernel when the locations are drawn from an isotropic Gaussian with variance less than 
1 and for bounded variance covariance functions. 
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A.4. Variational Inference for CNPF 

Variational inference for CNPF maximizes the Evidence Lower BOund (ELBO). The full ELBO 
for CNPE is 


T 

C =Eg[logp(a)] +Ejlogp(c)] +Ejlogp(s|a,c)] + ^ Ejlogp(r;fc|a)] 

k=l 

T I 

+ Eg[logp(4)] + ogp{aki) - logg(afci)] 

k=l i=l 

U T 

+ ^Eg[logp(d„)] +Ejlogp(/i„)] + '^Eg[\ogp{xku\du,Vk,s,ik) -logg(a;fc„)] 

u=l k=l 

I 

+ '^Eq[logp{yuik\xku,aki,yui) - logq{yuik)], (14) 

i=l 

where the ELBO for CNPE is augmented with an auxiliary variable yui to allow for analytic updates 
similar to Dunson and Herring (2005), Zhou et al. (2012), and Gopalan et al. (2014). We now 
describe its role. In variational inference, the update for a latent variable depends on the variational 
expectation of terms in the joint distribution where that variable appears (Bishop, 2006). The 
update of the components of the base measure depends on the observation log-likelihood 

k k 


The second term, 


yuiEg[logC^Xkuaik)], (15) 

k 

does not have analytic form when Xku and atk are gamma distributed. To address this, we de¬ 
compose the Poisson observation into a sum of Poisson variables. Prom the additivity of Poisson 
random variables, the Poisson observation in CNPE is equivalent to 

OO 

Vui ^ ^ Vuiki Vuik ~ PoiSSOn(37^.^Clj/j), 
k=l 

with the auxiliary jjuik marginalized out. The rate of these auxiliary Poisson is no longer a sum, so 
the variational expectations are tractable. 

In mean field variational inference, the update to the approximating family of a latent variable 
depends on the distribution of that latent variable conditional on everything else (Bishop, 2006). 
Conditional on yui, a, z, the vector yui is multinomially distributed (Zhou et al., 2012) as the 
following 


yui\yui,a,x ~ Mult( 


^ku^ik 


E oo 

k=l ^ku^ik 


)■ 
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We introduce these auxiliary variables for only those observations that are nonzero as Eq.l5 is zero 
for zero observations. 

In CNPF, there are global latent variables which are the set of global tuples, and local latent 
variables which are the correlated random measure associated with each patient. We define an 
equivalent objective in terms of just the global latent variables G by maximizing over the per- 
patient latent variables. 


= max©£(G', 0) = C{G, 0*(G)), 


where 0*(G) is the setting of the per-patient parameters that maximizes the ELBO given the global 
parameters G. To maximize we need to compute its gradient. By the chain rule, 


dC^ dC 


f)r rir 

(G, 0*(G)) + 0*(G))^^^(G) = 0*(G)). 


dG dG 


Thus in words, the gradient of the objective parameterized by just the global variables is the 
gradient of the original objective evaluated at the maximizing per-patient parameters given the 
global variables. This yields a mixed coordinate ascent/gradient ascent maximization for this ob¬ 
jective that allows for parallel computation across patients. We will now detail all of the global 
gradients followed by the local coordinate updates. 

Global gradients. Given the optimal variational parameters for each of the patients, we give 
the gradients of the variational parameters shared across patients. The global gradients may be 
prescaled by a positive definite matrix (preconditioner) for efficiency. 

Natural gradient of and The variational approximation for a^j, the positive condition 
weight in each component, is the same family as the prior, the gamma distribution. Here and 
represent the shape and rate of the approximation respectively. We compute natural gradients, 
which are gradients scaled by the inverse Fisher information matrix of the variational approxima¬ 
tion. These gradients have been shown to have good computation properties (Sato, 2001; Honkela 
et ah, 2008; Hoffman et ah, 2013). The natural gradient with respect to and (3^^ are 




From this equation computing the gradients require iterating over the entire observation matrix. For 
large, sparse observation matrices, this is inefficient. We rewrite the gradient in terms of nonzero 

Vui as 



u-yui>o 
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where we note X]n=i the same across all i. 


Gradient of s and 14. We define the following quantity that will be useful in writing the gradient 
for both s and 14 - 

dwku = -'h(wfc) + - \og{/3lJ - djik - iSu- 

Given this, the gradient of the rate of the gamma process is given by 

dC a. — 1 


ds 


U T 

-4- CS-}^}_^dWkuY^ 

u=l k=l 


and the gradient of the sticks is 

dC I — as 


U T 


/ , dwju - - 


Positivity constraints are handled by transforming to the inverse softplus (log(l + exp(a;))) space 
where the parameters are unconstrained. The gradient in this space follows directly from the 
previous equations and the chain rule. We handle all future positivity constraints in a similar 
manner. We handle the unit interval constraint on 14 with the inverse logistic transformation. 


Gradient of ik. The gradient of the locations that define the correlations is given by 


dik 


u 


a. 


^ U=1 


du -Wk + 


a 


ku 


T 


/3^^exp{du ik + ^d 


We use the negative Hessian as the predconditioner matrix. The Hessian of the ELBO with respect 
to locations is 


u 


--h-y^dA 

rTi 


a 


ku 


T 


dikdik exp{du ik + A 


(16) 


Gradient of a. The gradient of the base mass of the gamma process is given by 

da 


log(c) - (T + 1)^(4) + T^(q; - 1) + log(s) + 'y log(l - 14) + («« - 1) log(Q;) - ba, 


k=l 


where Oq, and ba are respectively the shape and rate of the hyperprior. We set Oq, to 1 and ba to 

0 . 01 . 


Gradient of c. The gradient for the point estimate of the gamma process rate is given by 

=-s + (Oc - 1) log(c) - be 

oc c 

where Oc and be are respectively the shape and rate of the hyperprior. We set both parameters to 
the same values as the prior on a. 
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Coordinate updates. To find the optimal per patient variational parameters, we iterate between 
eoordinate updates. 

Coordinate update of y^i. The variational distribution on veetor of auxiliary variables jjui is 
the multinomial distribution. The veetor (pui is the veetor of probabilities to this multinomial 
distribution. The optimal variational parameters given the rest of the model is given by 

Kik « exp(^(a^„) - log(/?^„) + - log(/3^i)). 

We again note that we only introduee auxiliary variables for yui that are nonzero. 

Coordinate update of Xku- We let the variational distribution over Xku be the gamma distribu¬ 
tion. The eoordinate updates for the shape of this variational family is 

^ku T ^ ^ ^ui4^uiki 


and the rate is 


i=l Pki 

We note that the sum over eonditions ean be eomputed onee and shared aeross all patients. 

Coordinate update of du and There is no simple elosed form solution for the eoordinate 
update of du- Instead, we use gradient aseent. The gradient of the ELBO with respeet to du is 
given by 


dC 

ddu 


— —du + ^k I —Wk + 


a 


ku 


T 


k=l 


/3^^exp{du ik + idu) 


and the gradient for the shared Gaussian proeess mean is 

T 


dC 


yu \ ^ . 

= -^ + 2_^-Wk + 


a 


ku 


k=l 


ldl^ex-y{du 4 +/i«) 


We terminate the proeedure when the ehange in du between steps falls below a threshold or a 
maximum number of iterations is reaehed. 


Variational inference. Algorithm 1 presents a variational inferenee algorithm using the gra¬ 
dients and eoordinate maximization proeedures derived in the previous seetion. For the global 
gradients without preeonditioners, we use RMSProp (Tieleman and Hinton, 2012). RMSProp is 
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Algorithm 1: Variational Inference for CNPF 
Input: data x. 

Initialize G randomly, t = 1. 

repeat 

for in parallel m = 1 to t/ do 

Optimize given G. 

end for 

Follow preconditioned update for G. 
until validation perplexity stops improving. 


a per-component learning rate, which can be viewed as multiplication by a diagonal matrix. For¬ 
mally, if gt is the gradient at iteration f, r is a number in the unit interval, and rjt is a scalar, then 
the RMS preconditioner pt can be computed as 

9t = (1 - ^)9t-i + ^diag(c/iC/7) 


Intuitively, RMSProp accounts for length scales and in the noisy setting takes smaller steps along 
noisier coordinates. 

As the optimization problem for each 0„ is independent given the global parameters G, we can 
parallelize this step. In all of our experiments we parallelize the maximization step across forty 
cores. We assess convergence using predictive perplexity on a held out collection of patients. 

Stochastic variational inference. The variational inference algorithm presented in Algorithm 1 
computes the optimal local variational parameters for each patient before updating the variational 
parameters for the random variables shared across patients at each iteration. As the number of 
patients grows large, this computational cost of this becomes prohibitive. To remedy this malady, 
we turn to stochastic variational inference (Hoffman et ah, 2013). 

Stochastic variational inference works by performing stochastic optimization (Robbins and 
Monro, 1951) on the variational objective. Stochastic optimization maximizes an objective by 
following a noisy gradient which is unbiased (in expectation is the true gradient). 

In stochastic variational inference, the noise stems from subsampling datapoints. This leads to 
quicker updates as the noisy gradients are based on a fraction of the entire objective. Consider a 
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Algorithm 2: Stochastic Variational Inference for CNPF 
Input: data x 

Initialize G randomly, t = 1. 

repeat 

Draw d ~ Unif(l, U). 

Optimize 0^ given G. 

Follow preconditioned update for G with stochastic gradients, 
until validation perplexity stops improving. 


patient u, then define the following objeetive 


T 

=Eg[logp(a)] +Ejlogp(c)] +Eg[logp(s|a,c)] + ^Ejlogp(r;fc|a)] 

k=l 

T I 

+ Eg[logp(4)] + EEe g[\ogp{aki) - logg(afci)] 

k=l i=l 

T 

+ U{Eg[logp{du)] + Eg[\ogp{pu)] + '^Eg[\ogp{xku\du,Vk,s,ik) -logg(a;fc„)] 

k=l 

I 

+ '^Eg[logp{yuik\xku, aki) - \ogq{yuik)])- (IV) 

If we let u ~ Unif(l, U), then Ed[C^] = C. Thus, gradient of where u is uniformly drawn 
from 1 to f/ is an unbiased gradient. The gradient of is eomputed by finding the loeal optimal 
parameter for the patient u and sealing it aeeording to the total number of patients. This objeetive 
and noisy gradient generalizes in a straightforward manner to drawing small batehes of patients. 

Computationally, stoehastie variational inferenee provides an advantage over Algorithm 1, as 
the slow part of Algorithm 1 for large datasets is computing the optimal loeal parameters for every 
single datum. Algorithm 2 summarizes stoehastie variational inference for the CNPE model. 

A.5. Stochastic Variational Inference for Baselines 

Both the HOP and the uneorrelated models are restrietions of CNPF The HOP is a restrietion of 
CNPF when the loeations I and sealings /i„ are set to zero, while the uneorrelated model only 
restriets the loeations to be zero. This means the only ehange required in inferenee is to fix the 
respeetive parameters to zero depending on whether inferring the HOP or the sealed HOP 

A.6. Stochastic Variational Inference for Softplus CNPF 

The transformation distribution for Softplus CNPF is 

Xui ~ Gamma (wi, -———• 

V log(l + exp{F(fi)})y 
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This means the variational updates for w, i, x, d, and m will be different. Define guk to be du ik + 

Softplus CNPF dw. The gradient of the weights is 

dwku = - log{/3^J - log(log(l + exp(^„fc))). 

Gradient of ik. Recall a is the logistic function. The gradient of the locations is given by 


dC 


1 ^ ^ 

^ di, 


(^{9 


uk ) 


9ik ^ log(l+ exp(c/„fc)) 


-Wk + 


a 


ku 


/^Llog(l + exp(^«fc)) 


Coordinate update of Xku- The coordinate updates for the shape of Xku is 


= Wk 


+ E ^ui4^uik') 




and the rate is 


l^tu = log(l + exp(^„fc)) + E 

i=i Pki 


Coordinate update of du and /i„. The gradient of the ELBO with respect to du is given by 

T 


dC 


dd,, 


~du + 'y ^ ' 




uk ) 


^ log(l+ exp(^„fc)) 

and the gradient for the shared Gaussian process mean is 

T 


-Wk + 


cr 


ku 


PL'^og{l + exp{guk)) J ’ 


dC _ 

rr2 A—/ 


(^{g 


uk ) 


-Wk + 


a 


ku 


^ log(l+ exp(^„fc)) V l3L'^og{l + exp{guk)), 

We terminate the procedure when the change in du between steps falls below a threshold. 
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